Project Title: Machine Learning Model Development for Weather Forecasting Using Global Historical Climatology Network Daily data.
Objective
The aim of this project is to analyze Global Historical Climatology Network daily (GHCNd) data to find the correlation among different feaures of a given location and use machine learning techniques to forecast a desired feature or more with some existing features as input for the learning problem.
The whole notebook will cover the following sections:
| # | Topic |
|---|---|
| 01 | Data Extraction |
| 02 | Data Analysis & Visualization |
| 03 | Data Processing |
| 04 | ML Model Development & Learning |
| 05 | Model Statistics |
| 06 | Conclusion |
I used the following Python Libraries for analysing the LTI systems:
| No | Library | No | Library |
|---|---|---|---|
| 01 | Pandas | 02 | Xarray |
| 03 | Numpy | 04 | Metpy |
| 05 | Matplotlib | 06 | Cartopy |
| 07 | Matplotlib 3D Axes Toolkit | 08 | Scikit-Learn |
| 09 | Datetime | 10 | Plotly Express |
| 11 | Seaborn | 12 | Matplotlib Pyplot |
| 13 | SciPy Stats | 14 | Decision Tree Regressor |
| 15 | KNeigbors Regressor | 16 | Linear Regression |
| 17 | Multi Output Regressor | 18 | Cartopy Features |
| 19 | plotly Figure Factory | 20 | Warnings |
Let us import all these necessary libraries:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import datetime as dt
import seaborn as sns
import xarray as xr
import metpy as metpy
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import plotly.express as px
import plotly.figure_factory as ff
from scipy.stats import pearsonr
import warnings as warnings
warnings.filterwarnings("ignore", category=UserWarning)
C:\Users\101114993\AppData\Local\anaconda3\Lib\site-packages\paramiko\transport.py:219: CryptographyDeprecationWarning: Blowfish has been deprecated and will be removed in a future release "class": algorithms.Blowfish,
Data Extraction
For this project we will use atmospheric dataset for QUILLAYUTE AIRPORT, WA US. The dataset is available at National Centers for Environmental Information. Details of the station information and data record period is given below:
| Attributes | Description |
|---|---|
| Name | QUILLAYUTE AIRPORT, WA US |
| Network ID | GHCND:USW00094240 |
| Latitude/Longitude | 47.93695°, -124.55757° |
| Elevation | 56.4 m |
Record Period
| Attributes | Description |
|---|---|
| Start Date | 1966-08-01 |
| End Date | 2025-05-03 |
| Data Coverage | 100% |
The data is stored in NetCDF (Network Common Data Form) file format, which is used to store and share scientific data, particularly in fields like atmospheric and oceanographic research. To open the data, we need Xarray library. We accesse this data from SD Mines THREDDS Data Server.
##########################################################
#
# Touching our netCDF File
#
ghcnd_file = "http://kyrill.ias.sdsmt.edu:8080/thredds/dodsC/GHCNd/USA/WA/GHCND-USW00094240__QUILLAYUTE_AIRPORT_WA_US.nc"
ds = xr.open_dataset(filename_or_obj = ghcnd_file,
decode_timedelta = True)
del ds["time_bounds"]
display(ds)
#
##########################################################
<xarray.Dataset>
Dimensions: (time: 21421)
Coordinates:
* time (time) datetime64[ns] 1966-08-01T12...
lat float64 ...
lon float64 ...
alt float64 ...
station_id |S64 ...
Data variables: (12/41)
precipitation_amount (time) float32 ...
thickness_of_snowfall_amount (time) float32 ...
surface_snow_thickness (time) float32 ...
maximum_air_temperature (time) float32 ...
minimum_air_temperature (time) float32 ...
dew_point_temperature (time) float32 ...
... ...
weather_event_freezing_rain (time) float32 ...
weather_event_snow (time) float32 ...
weather_event_unknown_precip (time) float32 ...
weather_event_ground_fog (time) float32 ...
weather_event_ice_fog (time) float32 ...
station_name |S64 ...
Attributes:
title: Global Historical Climatology Network da...
institution: NOAA Center for Environmental Prediction...
references: https://doi.org/10.1175/JTECH-D-11-00103.1
url: https://www.ncei.noaa.gov/products/land-...
GHCN_Station_Code: GHCND:USW00094240
Station_Name: QUILLAYUTE AIRPORT, WA US
Station_Latitude: 47.93695
Station_Longitude: -124.55757
Station_Elevation_in_Meters: 56.4
Conventions: CF-1.12
featureType: timeSeries
DODS_EXTRA.Unlimited_Dimension: timeConverting the xarray into pandas dataframe:
# Directly convert the xarray Dataset to a pandas DataFrame and save it as .csv file
df = ds.to_dataframe()
df.to_csv('ghcnd.csv', sep = ',')
Now we want to see whether the data came from one station or multiple station. In addition, we also want to see the features of the dataset and check them for missing values. To check the number of missing values, we will define a function that will calculate the number of instance of missing values for each feature, data start time, data end time, missing value start time etc. and finally output a dataframe containing all those informations.
print(f"Unique Stations are: {np.unique(df["station_name"])}\n",
f"\nNumber of features in the dataset: {len(df.columns.to_list())}")
Unique Stations are: [b'QUILLAYUTE AIRPORT, WA US'] Number of features in the dataset: 45
Data Analysis and Visualization
For a time series dataset, it is very important to know the frequency of the dataset. In many cases, the data scientists need more data than what they have in hand. For such cases they use upsampling technique. For our case, We want to check the frequency of the data. In addition, we want to know whether there is any missing time step in the dataframe or the dataframe is continuous.
# Infer the expected frequency (e.g., 'D' for daily, 'H' for hourly)
frequency = pd.infer_freq(df.index)
print(f"Data Frequency: {frequency}")
# Create a full expected time range
start_time = df.index.min() # DataFrame start time
end_time = df.index.max() # DataFrame end time
full_range = pd.date_range(start=start_time, end=end_time, freq=frequency)
# Find missing timestamps
missing_times = full_range.difference(df.index)
# Find missing timestamps
missing_times = full_range.difference(df.index)
print(f"Missing Timestamps: {len(missing_times)}")
# Check continuity programmatically
is_continuous = df.index.equals(full_range)
print(f"Is the time series continuous? {is_continuous}")
Data Frequency: D Missing Timestamps: 0 Is the time series continuous? True
def Missing_Summary(dataframe):
missing_summary = [] # Initialize a list to store results
start_time = dataframe.index.min() # DataFrame start time
end_time = dataframe.index.max() # DataFrame end time
# Iterate over each feature
for column in dataframe.columns:
if dataframe[column].isna().sum() == 0: # Skip if the column has no missing values
continue
series = dataframe[column] # Extract the series for the current feature
missing_times = series[series.isna()].index # Find all times where the value is missing
# Append to summary
missing_summary.append({
'Feature' : column,
'Start_Time' : start_time,
'End_Time' : end_time,
'First_Missing_Time': missing_times.min(),
'Last_Missing_Time' : missing_times.max(),
'Missing_Count' : len(missing_times)
})
missing_summary_df = pd.DataFrame(missing_summary) # Convert to DataFrame
return missing_summary_df
missing_summary_df = Missing_Summary(df)
missing_summary_df.head(3)
| Feature | Start_Time | End_Time | First_Missing_Time | Last_Missing_Time | Missing_Count | |
|---|---|---|---|---|---|---|
| 0 | precipitation_amount | 1966-08-01 12:00:00 | 2025-03-24 12:00:00 | 1999-08-10 12:00:00 | 2023-02-12 12:00:00 | 10 |
| 1 | thickness_of_snowfall_amount | 1966-08-01 12:00:00 | 2025-03-24 12:00:00 | 1996-12-01 12:00:00 | 2025-03-24 12:00:00 | 10310 |
| 2 | surface_snow_thickness | 1966-08-01 12:00:00 | 2025-03-24 12:00:00 | 1996-12-01 12:00:00 | 2025-03-24 12:00:00 | 10339 |
Now that we know, there are some features which have missing values, we want to know their exact number and what are those features.
incomplete_features = missing_summary_df['Feature'].values.tolist()
print(f"Number of Features with missing values: {len(incomplete_features)}",
f"\nList of Missing Features {incomplete_features}")
Number of Features with missing values: 20 List of Missing Features ['precipitation_amount', 'thickness_of_snowfall_amount', 'surface_snow_thickness', 'maximum_air_temperature', 'minimum_air_temperature', 'dew_point_temperature', 'air_pressure_at_sea_level', 'surface_air_pressure', 'wet_bulb_temperature', 'mean_wind_speed', 'percent_of_possible_sunshine', 'mean_relative_humidity', 'min_relative_humidity', 'max_relative_humidity', 'mean_air_temperature', 'duration_of_sunshine', 'liquid_water_content_of_surface_snow', 'maximum_1_minute_wind_speed', 'maximum_2_minute_wind_speed', 'wind_speed_of_gust']
From the missing summary dataframe, we observe that we do not have any feature that starts from the beginning and ends at current time without any NaN field. Our goal was to predict weather event based on the features that are available, but due to missing values, we need to change our objective. From the observation, we see that, some features have value from 1966 to 1996, whereas some other features have values from 2006 to 2025. We created two groups of varibales based on the available data. Our first dataset will contain following features:
- precipitation_amount
- maximum_air_temperature
- minimum_air_temperature
- thickness_of_snowfall_amount
- surface_snow_thickness
With this dataset (Period: 1966 to 1996), we will try to find the correlation among precipitation amount, average air temperature, thickness of snowfall amount and surface snow thickness. Our input sequene will have precipitation amount, average air temperature and our goal is to predict the thickness of snowfall amount and surface snow thickness.
Our second data segment (Period: 2006 to 2025) will contain following features:
- precipitation_amount
- dew_point_temperature
- air_pressure_at_sea_level
- surface_air_pressure
- maximum_air_temperature
- minimum_air_temperature
With this segment of dataset, we will try to predict average temperature based on precipitation, dew point temperature, and pressure values.
Snowfall Related Data Segment Analysis
We know the variables for snow fall related data segment, recording period of it.
var_snow = [ 'precipitation_amount',
'maximum_air_temperature',
'minimum_air_temperature',
'thickness_of_snowfall_amount',
'surface_snow_thickness']
dataset_1 = df[var_snow]
dataset_1.head(3)
| precipitation_amount | maximum_air_temperature | minimum_air_temperature | thickness_of_snowfall_amount | surface_snow_thickness | |
|---|---|---|---|---|---|
| time | |||||
| 1966-08-01 12:00:00 | 0.0 | 21.1 | 7.8 | 0.0 | 0.0 |
| 1966-08-02 12:00:00 | 0.0 | 23.9 | 13.3 | 0.0 | 0.0 |
| 1966-08-03 12:00:00 | 0.0 | 20.0 | 12.8 | 0.0 | 0.0 |
start_date = '1966-08-01'
end_date = '1996-12-01'
# Filter the DataFrame
dataset_1= dataset_1.loc[start_date:end_date].copy()
# Display the result
dataset_1.head(3)
| precipitation_amount | maximum_air_temperature | minimum_air_temperature | thickness_of_snowfall_amount | surface_snow_thickness | |
|---|---|---|---|---|---|
| time | |||||
| 1966-08-01 12:00:00 | 0.0 | 21.1 | 7.8 | 0.0 | 0.0 |
| 1966-08-02 12:00:00 | 0.0 | 23.9 | 13.3 | 0.0 | 0.0 |
| 1966-08-03 12:00:00 | 0.0 | 20.0 | 12.8 | 0.0 | 0.0 |
# Split data based on year threshold
train_end_time = '1992-12-31'
test_start_time = '1993-01-01'
# Ensure the time column is datetime type (replace 'time' with your actual date column name)
dataset_1.reset_index(inplace=True)
dataset_1['time'] = pd.to_datetime(dataset_1['time'])
dataset_1['sample'] = 'test'
dataset_1.loc[dataset_1['time'] <= train_end_time, 'sample'] = 'train'
dataset_1.head(5)
| time | precipitation_amount | maximum_air_temperature | minimum_air_temperature | thickness_of_snowfall_amount | surface_snow_thickness | sample | |
|---|---|---|---|---|---|---|---|
| 0 | 1966-08-01 12:00:00 | 0.0 | 21.100000 | 7.8 | 0.0 | 0.0 | train |
| 1 | 1966-08-02 12:00:00 | 0.0 | 23.900000 | 13.3 | 0.0 | 0.0 | train |
| 2 | 1966-08-03 12:00:00 | 0.0 | 20.000000 | 12.8 | 0.0 | 0.0 | train |
| 3 | 1966-08-04 12:00:00 | 0.0 | 17.800001 | 10.6 | 0.0 | 0.0 | train |
| 4 | 1966-08-05 12:00:00 | 0.0 | 21.100000 | 7.8 | 0.0 | 0.0 | train |
Now, we want to know how these variables and targets are changing with respect to time. I will use Plotly Express libary to plot the variables and targets with interactive range slider.
import plotly.express as px
def plot(data_frame, x_column, y_column, title, y_unit):
"""Plot time series with unit-aware y-axis and rangeslider"""
fig = px.line(data_frame, x=x_column, y=y_column, title=title)
fig.update_layout(
yaxis_title=y_unit,
xaxis_title='Time',
xaxis_rangeslider_visible=True
)
fig.show()
return fig # Return the figure object instead of showing it
# Define variables, titles, and units
variables = [
('maximum_air_temperature', 'Maximum Air Temperature', '°C'),
('precipitation_amount', 'Precipitation', 'kg/m²'),
('thickness_of_snowfall_amount', 'Snowfall Thickness', 'm'),
('surface_snow_thickness', 'Surface Snow Depth', 'm')
]
# Generate and save plots with proper units
for col, title, unit in variables:
fig = plot(dataset_1, 'time', col, title, unit)
fig.write_html(f"{col}.html")
Now we want to know how the targets are changing concerning the input change. First, we will create a new feature by taking the average of minimum and maximum air temperature. Then we will calculate the difference in average temperature ($\Delta T$) between the consecutive days and plot the amount of surface snow thickness and snowfall amount with this $\Delta T$.
dataset_1 = (
dataset_1
.assign(Avg_T=lambda x: (x['maximum_air_temperature'] + x['minimum_air_temperature']) / 2)
# .drop(columns=['maximum_air_temperature', 'minimum_air_temperature'])
)
dataset_1_lag = dataset_1.shift(periods=-1)
dataset_1["Delta_T"] = dataset_1["Avg_T"] - dataset_1_lag["Avg_T"]
fig = px.scatter(x = dataset_1["Delta_T"], y = dataset_1["surface_snow_thickness"], title="Variation of Surface Snow Thickness w.r.t. ΔT")
fig.update_layout(
yaxis_title = "m",
xaxis_title = "∆T",
xaxis_rangeslider_visible=True
)
fig.write_html("SST.html")
fig.show()
dataset_1_lag = dataset_1.shift(periods=-1)
fig = px.scatter(x = dataset_1["Delta_T"], y = dataset_1["thickness_of_snowfall_amount"], title="Variation of Snowfall Thickness w.r.t. ∆T")
fig.update_layout(
yaxis_title = "m",
xaxis_title ='∆T',
xaxis_rangeslider_visible=True)
fig.show()
From the above figures, we can say that the distribution of the targeted outputs concerning the change in temperature varies according to normal distribution.
Now we want to know what kind of distribution the $\Delta T$ has and how it's changing during no snow day and snow day. To learn about the variation of average temperature, let us plot the Kernel Density Estimate plot.
dataset_1_nosnow = dataset_1[dataset_1["thickness_of_snowfall_amount"]==0]
fig1 = ff.create_distplot(
[dataset_1_nosnow["Delta_T"]],
['No Snow'],
show_hist=True,
show_rug=True
)
fig1.add_vline(x=0, line_dash="dash", line_color="black")
fig1.update_layout(
yaxis_title = "Density",
xaxis_title ='∆T',
title = "KDE Plot of ∆T for No Snow day",
xaxis_rangeslider_visible=True
)
fig1.write_html("KDE_Delta_T.html")
fig1.show()
From above KDE plot, we see that:
- Distribution Center: Peaks near 0°C $\Delta T$
- Density: ~0.05 to 0.2
- Spread: -5°C to +5°C
This means:
- Most temperature differences cluster near $\Delta T$ = 0 (highest density at center)
- Moderate spread (±5°C temperature variation)
dataset_1_snow = dataset_1[dataset_1["thickness_of_snowfall_amount"]>0]
fig2 = ff.create_distplot(
[dataset_1_snow["Delta_T"]],
['Snow'],
show_hist=True,
show_rug=True
)
fig2.add_vline(x=0, line_dash="dash", line_color="black")
fig2.update_layout(
yaxis_title = "Density",
xaxis_title ='∆T',
title = "KDE Plot of ∆T for Snow day",
xaxis_rangeslider_visible=True
)
fig2.write_html("KDE_∆T.html")
fig2.show()
From snow day KDE plot, we observe that:
- Distribution Center: Peaks near 0°C Delta_T
- Density: ~0 to 0.18
- Spread: -8°C to +6°C
- Left Skew: Longer tail towards negative values
In addition, the 0°C reference line shows:
- Majority of snow days have some temperature variation
- Completely stable temperatures $\Delta T$ = 0 are rare
So it is evident that no snow days have:
- Narrower distribution (taller peak at 0°C) than snow days and
- Less negative skew
Now we want to quantify the distribution characteristics of $\Delta T$. To do so, we will use SciPy Stats library.
snow_stats = {
"mean": dataset_1_snow["Delta_T"].mean(),
"std": dataset_1_snow["Delta_T"].std(),
"skew": dataset_1_snow["Delta_T"].skew(),
"kurtosis": dataset_1_snow["Delta_T"].kurtosis()
}
print("Snow Day Statistics:")
for k,v in snow_stats.items():
print(f"{k}: {v:.2f}")
# Compare with no-snow days
print("\nProbability of Delta_T < 0:")
print("Snow days:", (dataset_1_snow["Delta_T"] < 0).mean())
print("No-snow days:", (dataset_1_nosnow["Delta_T"] < 0).mean())
Snow Day Statistics: mean: -0.25 std: 2.58 skew: -0.51 kurtosis: 0.52 Probability of Delta_T < 0: Snow days: 0.47440273037542663 No-snow days: 0.4676925929359414
From the above test, we see that the probability of a temperature drop ($\Delta T < 0$) is very similar between snow days and no-snow days. This suggests that temperature drops alone are not a strong predictor of snow occurrence in the dataset. Finally, to visualize the correlation among different features and observe the data distribution pattern, let us do a pair plot.
fig = px.scatter_matrix(dataset_1,
dimensions=["precipitation_amount",
"Avg_T",
"thickness_of_snowfall_amount",
"surface_snow_thickness",
],
color="sample", symbol="sample",
title="Scatter matrix of Snow Dataset",
labels={col:col.replace('_', ' ') for col in dataset_1.columns})
fig.update_traces(diagonal_visible=False)
fig.update_layout(title="Pair Plot of Snow Related Atmospheric Dataset",
dragmode='select',
width=1000,
height=1000,
hovermode='closest')
fig.write_html(f"Pair_Plot.html")
fig.show()
Average Temperature Prediction Data Segment Analysis
temp_vars = [ 'precipitation_amount',
'dew_point_temperature',
'air_pressure_at_sea_level',
'surface_air_pressure',
'maximum_air_temperature',
'minimum_air_temperature'
]
dataset_2 = df[temp_vars]
dataset_2.head(3)
| precipitation_amount | dew_point_temperature | air_pressure_at_sea_level | surface_air_pressure | maximum_air_temperature | minimum_air_temperature | |
|---|---|---|---|---|---|---|
| time | ||||||
| 1966-08-01 12:00:00 | 0.0 | NaN | NaN | NaN | 21.1 | 7.8 |
| 1966-08-02 12:00:00 | 0.0 | NaN | NaN | NaN | 23.9 | 13.3 |
| 1966-08-03 12:00:00 | 0.0 | NaN | NaN | NaN | 20.0 | 12.8 |
This dataset's recording period starts from 2006 to current time.
start_date = '2006-01-01'
end_date = '2025-03-24'
dataset_2 = dataset_2.loc[start_date:end_date].copy()
Let us split the dataset into train and test sets. We chose to train the model with data from 2006 to 2022 and the rest of the data will be used as test data.
# Split data based on year threshold
train_end_time = '2022-12-31'
test_start_time = '2023-01-01'
# Ensure the time column is datetime type (replace 'time' with your actual date column name)
dataset_2.reset_index(inplace=True)
dataset_2['time'] = pd.to_datetime(dataset_2['time'])
dataset_2['sample'] = 'test'
dataset_2.loc[dataset_2['time'] <= train_end_time, 'sample'] = 'train'
dataset_2.head(3)
| time | precipitation_amount | dew_point_temperature | air_pressure_at_sea_level | surface_air_pressure | maximum_air_temperature | minimum_air_temperature | sample | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2006-01-01 12:00:00 | 16.500000 | 6.1 | 989.200012 | 981.700012 | 10.600000 | 6.7 | train |
| 1 | 2006-01-02 12:00:00 | 5.100000 | 4.4 | 1002.400024 | 994.900024 | 8.900001 | 2.2 | train |
| 2 | 2006-01-03 12:00:00 | 8.900001 | 3.9 | 1008.799988 | 1001.400024 | 8.300000 | 3.3 | train |
We are not going to plot how these variables are changing with respect to time, instead we will check how the precipitation amount changes with the change in surface air pressure.
dataset_2_lag = dataset_2.shift(periods=-1)
dataset_2["Delta_P"] = dataset_2["surface_air_pressure"] - dataset_2_lag["surface_air_pressure"]
dataset_2_prec = dataset_2[dataset_2["precipitation_amount"]>0].dropna()
fig2 = ff.create_distplot(
[dataset_2_prec["Delta_P"]],
['Precipitation'],
show_hist=True,
show_rug=True
)
fig2.add_vline(x=0, line_dash="dash", line_color="black")
fig2.update_layout(
yaxis_title = "Probability Density",
xaxis_title ='Delta P',
title = "KDE Plot of Delta P for Precipitation",
xaxis_rangeslider_visible=True
)
fig2.show()
From KDE plot, we observe that:
- Unimodal Distribution
- Distribution Center: Peaks near 0 $\Delta P$
- Density: ~0 to 0.07
- Spread: -20 inHg to +20 inHg
Let's plot the same KDE for Average Temperature Variation with repect to $\Delta P$.
dataset_2 = (
dataset_2
.assign(Avg_T=lambda x: (x['maximum_air_temperature'] + x['minimum_air_temperature']) / 2)
# .drop(columns=['maximum_air_temperature', 'minimum_air_temperature'])
)
dataset_2_AvgT = dataset_2[dataset_2["Avg_T"]>0].dropna()
fig2 = ff.create_distplot(
[dataset_2_AvgT["Delta_P"]],
['Temperature'],
show_hist=True,
show_rug=True
)
fig2.add_vline(x=0, line_dash="dash", line_color="black")
fig2.update_layout(
yaxis_title = "Probability Density",
xaxis_title ='Delta P',
title = "KDE Plot of Delta P for Average Temperature",
xaxis_rangeslider_visible=True
)
fig2.show()
From KDE plot, we observe that:
- Unimodal Distribution
- Distribution Center: Peaks near 0 $\Delta P$
- Density: ~0 to 0.087
- Spread: -20 inHg to +20 inHg
Now, let us find the Pearson Correlation for this data segment and plot it.
correlation_matrix2 = dataset_2.drop(["sample",
"time",
"Avg_T",
"minimum_air_temperature",
"maximum_air_temperature",
"Delta_P"],
axis=1).corr(method='pearson')
print(correlation_matrix2)
precipitation_amount dew_point_temperature \
precipitation_amount 1.000000 0.060049
dew_point_temperature 0.060049 1.000000
air_pressure_at_sea_level -0.353134 -0.099304
surface_air_pressure -0.358946 -0.092882
air_pressure_at_sea_level surface_air_pressure
precipitation_amount -0.353134 -0.358946
dew_point_temperature -0.099304 -0.092882
air_pressure_at_sea_level 1.000000 0.996697
surface_air_pressure 0.996697 1.000000
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix2, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Pearson Correlation Matrix for Data Segment 2")
plt.show()
From the Pearson Correlation, we see that following features are highly correlated:
- Air Pressure at Sea Level and Surface Air Pressure.
So, we will consider only one of them in the input feature for machine learning purpose.
Data Processing
Now we want our data to be ready for feeding it to the Machine Learning Algorithms. So, we need to remove all the NaN for the dataset.
Data Segment 1
# Counting Null Values
print(f"Null Values in Data:\n\n{dataset_1.isnull().sum()}")
# Remove rows with ANY null values (default behavior)
dataset_1 = dataset_1.dropna()
print(f"Null Values in Data:\n\n{dataset_1.isnull().sum()}")
Null Values in Data: time 0 precipitation_amount 0 maximum_air_temperature 0 minimum_air_temperature 0 thickness_of_snowfall_amount 1 surface_snow_thickness 1 sample 0 Avg_T 0 Delta_T 1 dtype: int64 Null Values in Data: time 0 precipitation_amount 0 maximum_air_temperature 0 minimum_air_temperature 0 thickness_of_snowfall_amount 0 surface_snow_thickness 0 sample 0 Avg_T 0 Delta_T 0 dtype: int64
Data Segment 2
print(f"Null Values in Data:\n\n{dataset_2.isnull().sum()}")
dataset_2 = dataset_2.dropna()
print(f"Null Values in Data:\n\n{dataset_2.isnull().sum()}")
Null Values in Data: time 0 precipitation_amount 9 dew_point_temperature 333 air_pressure_at_sea_level 318 surface_air_pressure 318 maximum_air_temperature 9 minimum_air_temperature 5 sample 0 Delta_P 356 Avg_T 10 dtype: int64 Null Values in Data: time 0 precipitation_amount 0 dew_point_temperature 0 air_pressure_at_sea_level 0 surface_air_pressure 0 maximum_air_temperature 0 minimum_air_temperature 0 sample 0 Delta_P 0 Avg_T 0 dtype: int64
Now, we will make a list of inputs and outputs for both data segments.
# Define input features and target variables
inputs_1 = ['precipitation_amount', 'Avg_T']
targets_1 = ['thickness_of_snowfall_amount', 'surface_snow_thickness']
inputs_2 = [ 'precipitation_amount',
'dew_point_temperature',
'surface_air_pressure'
]
targets_2 = ['Avg_T']
We split the dataset into train and test.
# Data Segment 1: Split into train/test based on 'sample' column
X_train_1 = dataset_1.loc[dataset_1['sample'] == 'train', inputs_1]
X_test_1 = dataset_1.loc[dataset_1['sample'] == 'test', inputs_1]
y_train_1 = dataset_1.loc[dataset_1['sample'] == 'train', targets_1]
y_test_1 = dataset_1.loc[dataset_1['sample'] == 'test', targets_1]
# Data Segment 2
X_train_2 = dataset_2.loc[dataset_2['sample'] == 'train', inputs_2]
X_test_2 = dataset_2.loc[dataset_2['sample'] == 'test', inputs_2]
y_train_2 = dataset_2.loc[dataset_2['sample'] == 'train', targets_2]
y_test_2 = dataset_2.loc[dataset_2['sample'] == 'test', targets_2]
print(X_train_1.head(3))
"\n"
print(y_train_1.head(3))
"\n \n"
print(X_train_2.head(3))
"\n"
print(y_train_2.head(3))
precipitation_amount Avg_T
0 0.0 14.450001
1 0.0 18.600000
2 0.0 16.400000
thickness_of_snowfall_amount surface_snow_thickness
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
precipitation_amount dew_point_temperature surface_air_pressure
0 16.500000 6.1 981.700012
1 5.100000 4.4 994.900024
2 8.900001 3.9 1001.400024
Avg_T
0 8.650001
1 5.550000
2 5.800000
Let us check the shapes of all train and test sets:
print(f"***********************************************************",
f"\nData Segment 1",
f"\n***********************************************************",
f"\nShape of Input Train Data: {X_train_1.shape}",
f"\nShape of Targets Train Data: {y_train_1.shape}",
f"\nShape of Input Test Data: {X_test_1.shape}",
f"\nShape of Target Test Data: {y_test_1.shape}\n"
)
print(f"***********************************************************",
f"\nData Segment 2",
f"\n***********************************************************",
f"\nShape of Input Train Data: {X_train_2.shape}",
f"\nShape of Target Train Data: {y_train_2.shape}"
f"\nShape of Input Test Data: {X_test_2.shape}",
f"\nShape of Target Test Data: {y_test_2.shape}"
)
*********************************************************** Data Segment 1 *********************************************************** Shape of Input Train Data: (9649, 2) Shape of Targets Train Data: (9649, 2) Shape of Input Test Data: (1431, 2) Shape of Target Test Data: (1431, 2) *********************************************************** Data Segment 2 *********************************************************** Shape of Input Train Data: (6039, 3) Shape of Target Train Data: (6039, 1) Shape of Input Test Data: (598, 3) Shape of Target Test Data: (598, 1)
ML Model Development and Learning
Regression refers to a predictive modeling problem that involves predicting a numerical value. ome regression machine learning algorithms support multiple outputs directly.
This includes most of the popular machine learning algorithms implemented in the scikit-learn library, such as:
- Multi-output Regressor
- K-Neighbors Regressor and
- Decision Tree Regressor
We will feed our dataset into all these regressors, count their training time, mean squared errors and $R^2$ score for train and test set.
Importing ML Models:
from sklearn.linear_model import LinearRegression
from sklearn.multioutput import MultiOutputRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import time
# Initialize models
models_1 = {
"MultiOutput_LinearRegression": MultiOutputRegressor(LinearRegression()),
"DecisionTree": DecisionTreeRegressor(),
"KNeighbors": KNeighborsRegressor()
}
We will write a loop to train our models with the data segments and store different performance metrics for each models. Now, let us initialize an empty dictionary to store Training Time, Mean Squared Errors, $R^2$ Score for train and test sets.
# Dictionary to store all MSE scores
model_stats_1 = {
'Training_Time' : [],
'Model': [],
'Train_MSE': [],
'Test_MSE': [],
'train_R²_Score' : [],
'test_R²_Score' : []
}
Training Data Segment 1 to Predict Snowfall Amount and Surface Snow Thickness
for model_name, model in models_1.items():
start = time.time()
# Train model
model.fit(X_train_1, y_train_1)
end = time.time()
model_stats_1[f'Training_Time'] = end - start
# Get predictions
train_pred = model.predict(X_train_1)
test_pred = model.predict(X_test_1)
# Calculate MSE for each output
train_mse = mean_squared_error(y_train_1,
train_pred)
test_mse = mean_squared_error(y_test_1,
test_pred)
model_stats_1[f'Train_MSE'].append(train_mse)
model_stats_1[f'Test_MSE'].append(test_mse)
train_r2 = model.score(X_train_1,
y_train_1)
test_r2 = model.score(X_test_1,
y_test_1)
model_stats_1[f'train_R²_Score'].append(train_r2)
model_stats_1[f'test_R²_Score'].append(train_r2)
model_stats_1['Model'].append(model_name)
Training Data Segment 2 for Predicting Average Temperature
# Initialize models
models_2 = {
"LinearRegression": LinearRegression(),
"DecisionTree": DecisionTreeRegressor(),
"KNeighbors": KNeighborsRegressor()
}
# Dictionary to store all MSE scores
model_stats_2 = {
'Training_Time' : [],
'Model': [],
'Train_MSE': [],
'Test_MSE': [],
'train_R²_Score' : [],
'test_R²_Score' : []
}
for model_name, model in models_2.items():
start = time.time()
# Train model
model.fit(X_train_2, y_train_2)
end = time.time()
model_stats_2[f'Training_Time'] = end - start
# Get predictions
train_pred = model.predict(X_train_2)
test_pred = model.predict(X_test_2)
# Calculate MSE for each output
train_mse = mean_squared_error(y_train_2,
train_pred)
test_mse = mean_squared_error(y_test_2,
test_pred)
model_stats_2[f'Train_MSE'].append(train_mse)
model_stats_2[f'Test_MSE'].append(test_mse)
train_r2 = model.score(X_train_2,
y_train_2)
test_r2 = model.score(X_test_2,
y_test_2)
model_stats_2[f'train_R²_Score'].append(train_r2)
model_stats_2[f'test_R²_Score'].append(train_r2)
model_stats_2['Model'].append(model_name)
Model Statistics
# Convert to DataFrame for better visualization
Model_Statistics_1 = pd.DataFrame(model_stats_1).set_index('Model')
Model_Statistics_2 = pd.DataFrame(model_stats_2).set_index('Model')
print(f"*****************************",
f"\nStatistics for Data Segment 1",
f"\n*****************************\n")
Model_Statistics_1.head()
***************************** Statistics for Data Segment 1 *****************************
| Training_Time | Train_MSE | Test_MSE | train_R²_Score | test_R²_Score | |
|---|---|---|---|---|---|
| Model | |||||
| MultiOutput_LinearRegression | 0.00518 | 0.000193 | 0.000040 | 0.044296 | 0.044296 |
| DecisionTree | 0.00518 | 0.000018 | 0.000080 | 0.933264 | 0.933264 |
| KNeighbors | 0.00518 | 0.000108 | 0.000036 | 0.519766 | 0.519766 |
From the above statistics, we see that all of the models take the same time to train on the dataset, and Decision Tree outperforms both Multi-Output Linear Regression and KNeighbors.
print(f"*****************************",
f"\nStatistics for Data Segment 2",
f"\n*****************************\n")
Model_Statistics_2.head()
***************************** Statistics for Data Segment 2 *****************************
| Training_Time | Train_MSE | Test_MSE | train_R²_Score | test_R²_Score | |
|---|---|---|---|---|---|
| Model | |||||
| LinearRegression | 0.005213 | 4.613514 | 5.032811 | 0.795707 | 0.795707 |
| DecisionTree | 0.005213 | 1.094333 | 5.568324 | 0.951541 | 0.951541 |
| KNeighbors | 0.005213 | 2.863456 | 4.624170 | 0.873202 | 0.873202 |
From the above statistics, we see that Multi-Output Linear Regression performs very poorly compared to Decision Tree and KNeighbors. Though Decision Tree provides the best training MSE, its performance is not that good in comparison with the other two on test data. But, Decision Tree provides the best $R^2$ score for both train and test data.
Conclusion
We aimed to analyze the Global Historical Climatology Network's daily data and try to learn the relations among different features. We divided the data into two segments and fed them to different ML models. For our first segment, we fed the ML with average temperature and precipitation amount to predict surface snow thickness and snowfall amount. Decision Tree model outperformed other models in this work. For Decision Tree model, train MSE: 0.000018, Test MSE: 0.000080, Train $R^2$: 0.933264 and Test $R^2$: 0.933264. In the second data segment, our input features were precipitation amount, dew point temperature, and surface air pressure, and the target was average temperature. This time, KNeighbors outperformed other models with train MSE: 2.863456, Test MSE: 4.624170, Train $R^2$: 0.873202 and Test $R^2$: 0.873202.